TP SP examples improvement #1354

githubsgi · 2025-06-11T23:35:10Z

Changing cuda to accelerator, adding ConmDebugMode to tensor_parallel_example.py, sequence_parallel_example.py, and log_utils.py .

netlify · 2025-06-11T23:35:15Z

✅ Deploy Preview for pytorch-examples-preview canceled.

Name	Link
🔨 Latest commit	`d16c819`
🔍 Latest deploy log	https://app.netlify.com/projects/pytorch-examples-preview/deploys/685b4a62e471560008f3f150

msaroufim · 2025-06-15T22:59:25Z

distributed/tensor_parallelism/tensor_parallel_example.py

-    output.sum().backward()
-    optimizer.step()
+    inp = torch.rand(4, 10, device=device_type)
+    comm_mode = CommDebugMode()


does this work on non cuda devices? Would be great to share some local logs of your tests

Gladly. Please see attached logs for H100.

Starting PyTorch TP example on rank 3. Starting PyTorch TP example on rank 0. 06/16/2025 05:55:00 PM Device Mesh created: device_mesh=DeviceMesh('cuda', [0, 1, 2, 3]) Starting PyTorch TP example on rank 2. Starting PyTorch TP example on rank 1. model ToyModel( (in_proj): Linear(in_features=10, out_features=32, bias=True) (relu): ReLU() (out_proj): Linear(in_features=32, out_features=5, bias=True) ) 06/16/2025 05:55:03 PM Tensor Parallel training starting... 06/16/2025 05:55:03 PM Tensor Parallel iter 0 completed rank3 1 get_comm_counts defaultdict(<class 'int'>, {<OpOverloadPacket(op='c10d_functional.all_reduce')>: 1}) get_sharding_info() {'ToyModel.in_proj.weight': (Shard(dim=0),), 'ToyModel.in_proj.bias': (Shard(dim=0),), 'ToyModel.out_proj.weight': (Shard(dim=1),), 'ToyModel.out_proj.bias': (Replicate(),)} generate_comm_debug_tracing_table Global FORWARD PASS *c10d_functional.all_reduce: 1 BACKWARD PASS ToyModel *module type: class '__main__.ToyModel' FORWARD PASS *c10d_functional.all_reduce: 1 ToyModel.in_proj *module type: class 'torch.nn.modules.linear.Linear' *Parameter List *weight: (Shard(dim=0),) *bias: (Shard(dim=0),) FORWARD PASS **aten.addmm.default shape: [torch.Size([32]), torch.Size([4, 10]), torch.Size([10, 32])] sharding: [(Shard(dim=0),), (Replicate(),), (Shard(dim=1),)] device mesh: DeviceMesh('cuda', [0, 1, 2, 3]) BACKWARD PASS **aten.mm.default shape: [torch.Size([32, 4]), torch.Size([4, 10])] sharding: [(Shard(dim=0),), (Replicate(),)] device mesh: DeviceMesh('cuda', [0, 1, 2, 3]) **aten.sum.dim_IntList shape: [torch.Size([4, 32])] sharding: [(Shard(dim=1),)] device mesh: DeviceMesh('cuda', [0, 1, 2, 3]) **aten.add_.Tensor shape: [torch.Size([32]), torch.Size([32])] sharding: [(Shard(dim=0),), (Shard(dim=0),)] device mesh: DeviceMesh('cuda', [0, 1, 2, 3]) **aten.add_.Tensor shape: [torch.Size([32, 10]), torch.Size([32, 10])] sharding: [(Shard(dim=0),), (Shard(dim=0),)] device mesh: DeviceMesh('cuda', [0, 1, 2, 3]) ToyModel.relu *module type: class 'torch.nn.modules.activation.ReLU' FORWARD PASS BACKWARD PASS ToyModel.out_proj *module type: class 'torch.nn.modules.linear.Linear' *Parameter List *weight: (Shard(dim=1),) *bias: (Replicate(),) FORWARD PASS *c10d_functional.all_reduce: 1 **aten.addmm.default shape: [torch.Size([5]), torch.Size([4, 32]), torch.Size([32, 5])] sharding: [(Replicate(),), (Shard(dim=1),), (Shard(dim=0),)] device mesh: DeviceMesh('cuda', [0, 1, 2, 3]) BACKWARD PASS **aten.mm.default shape: [torch.Size([4, 5]), torch.Size([5, 32])] sharding: [(Replicate(),), (Shard(dim=1),)] device mesh: DeviceMesh('cuda', [0, 1, 2, 3]) **aten.mm.default shape: [torch.Size([5, 4]), torch.Size([4, 32])] sharding: [(Replicate(),), (Shard(dim=1),)] device mesh: DeviceMesh('cuda', [0, 1, 2, 3]) **aten.sum.dim_IntList shape: [torch.Size([4, 5])] sharding: [(Replicate(),)] device mesh: DeviceMesh('cuda', [0, 1, 2, 3]) **aten.add_.Tensor shape: [torch.Size([5]), torch.Size([5])] sharding: [(Replicate(),), (Replicate(),)] device mesh: DeviceMesh('cuda', [0, 1, 2, 3]) **aten.add_.Tensor shape: [torch.Size([5, 32]), torch.Size([5, 32])] sharding: [(Shard(dim=1),), (Shard(dim=1),)] device mesh: DeviceMesh('cuda', [0, 1, 2, 3]) 06/16/2025 05:55:03 PM Tensor Parallel iter 1 completed rank0 1 get_comm_counts defaultdict(<class 'int'>, {<OpOverloadPacket(op='c10d_functional.all_reduce')>: 1}) get_sharding_info() {'ToyModel.in_proj.weight': (Shard(dim=0),), 'ToyModel.in_proj.bias': (Shard(dim=0),), 'ToyModel.out_proj.weight': (Shard(dim=1),), 'ToyModel.out_proj.bias': (Replicate(),)} generate_comm_debug_tracing_table Global FORWARD PASS *c10d_functional.all_reduce: 1 BACKWARD PASS ToyModel *module type: class '__main__.ToyModel' FORWARD PASS *c10d_functional.all_reduce: 1 ToyModel.in_proj *module type: class 'torch.nn.modules.linear.Linear' *Parameter List *weight: (Shard(dim=0),) *bias: (Shard(dim=0),) FORWARD PASS **aten.addmm.default shape: [torch.Size([32]), torch.Size([4, 10]), torch.Size([10, 32])] sharding: [(Shard(dim=0),), (Replicate(),), (Shard(dim=1),)] device mesh: DeviceMesh('cuda', [0, 1, 2, 3]) BACKWARD PASS **aten.mm.default shape: [torch.Size([32, 4]), torch.Size([4, 10])] sharding: [(Shard(dim=0),), (Replicate(),)] device mesh: DeviceMesh('cuda', [0, 1, 2, 3]) **aten.sum.dim_IntList shape: [torch.Size([4, 32])] sharding: [(Shard(dim=1),)] device mesh: DeviceMesh('cuda', [0, 1, 2, 3]) **aten.add_.Tensor shape: [torch.Size([32]), torch.Size([32])] sharding: [(Shard(dim=0),), (Shard(dim=0),)] device mesh: DeviceMesh('cuda', [0, 1, 2, 3]) **aten.add_.Tensor shape: [torch.Size([32, 10]), torch.Size([32, 10])] sharding: [(Shard(dim=0),), (Shard(dim=0),)] device mesh: DeviceMesh('cuda', [0, 1, 2, 3]) ToyModel.relu *module type: class 'torch.nn.modules.activation.ReLU' FORWARD PASS BACKWARD PASS ToyModel.out_proj *module type: class 'torch.nn.modules.linear.Linear' *Parameter List *weight: (Shard(dim=1),) *bias: (Replicate(),) FORWARD PASS *c10d_functional.all_reduce: 1 **aten.addmm.default shape: [torch.Size([5]), torch.Size([4, 32]), torch.Size([32, 5])] sharding: [(Replicate(),), (Shard(dim=1),), (Shard(dim=0),)] device mesh: DeviceMesh('cuda', [0, 1, 2, 3]) BACKWARD PASS **aten.mm.default shape: [torch.Size([4, 5]), torch.Size([5, 32])] sharding: [(Replicate(),), (Shard(dim=1),)] device mesh: DeviceMesh('cuda', [0, 1, 2, 3]) **aten.mm.default shape: [torch.Size([5, 4]), torch.Size([4, 32])] sharding: [(Replicate(),), (Shard(dim=1),)] device mesh: DeviceMesh('cuda', [0, 1, 2, 3]) **aten.sum.dim_IntList shape: [torch.Size([4, 5])] sharding: [(Replicate(),)] device mesh: DeviceMesh('cuda', [0, 1, 2, 3]) **aten.add_.Tensor shape: [torch.Size([5]), torch.Size([5])] sharding: [(Replicate(),), (Replicate(),)] device mesh: DeviceMesh('cuda', [0, 1, 2, 3]) **aten.add_.Tensor shape: [torch.Size([5, 32]), torch.Size([5, 32])] sharding: [(Shard(dim=1),), (Shard(dim=1),)] device mesh: DeviceMesh('cuda', [0, 1, 2, 3]) rank2 1 get_comm_counts defaultdict(<class 'int'>, {<OpOverloadPacket(op='c10d_functional.all_reduce')>: 1}) get_sharding_info() {'ToyModel.in_proj.weight': (Shard(dim=0),), 'ToyModel.in_proj.bias': (Shard(dim=0),), 'ToyModel.out_proj.weight': (Shard(dim=1),), 'ToyModel.out_proj.bias': (Replicate(),)} generate_comm_debug_tracing_table Global FORWARD PASS *c10d_functional.all_reduce: 1 BACKWARD PASS ToyModel *module type: class '__main__.ToyModel' FORWARD PASS *c10d_functional.all_reduce: 1 ToyModel.in_proj *module type: class 'torch.nn.modules.linear.Linear' *Parameter List *weight: (Shard(dim=0),) *bias: (Shard(dim=0),) FORWARD PASS **aten.addmm.default shape: [torch.Size([32]), torch.Size([4, 10]), torch.Size([10, 32])] sharding: [(Shard(dim=0),), (Replicate(),), (Shard(dim=1),)] device mesh: DeviceMesh('cuda', [0, 1, 2, 3]) BACKWARD PASS **aten.mm.default shape: [torch.Size([32, 4]), torch.Size([4, 10])] sharding: [(Shard(dim=0),), (Replicate(),)] device mesh: DeviceMesh('cuda', [0, 1, 2, 3]) **aten.sum.dim_IntList shape: [torch.Size([4, 32])] sharding: [(Shard(dim=1),)] device mesh: DeviceMesh('cuda', [0, 1, 2, 3]) **aten.add_.Tensor shape: [torch.Size([32]), torch.Size([32])] sharding: [(Shard(dim=0),), (Shard(dim=0),)] device mesh: DeviceMesh('cuda', [0, 1, 2, 3]) **aten.add_.Tensor shape: [torch.Size([32, 10]), torch.Size([32, 10])] sharding: [(Shard(dim=0),), (Shard(dim=0),)] device mesh: DeviceMesh('cuda', [0, 1, 2, 3]) ToyModel.relu *module type: class 'torch.nn.modules.activation.ReLU' FORWARD PASS BACKWARD PASS ToyModel.out_proj *module type: class 'torch.nn.modules.linear.Linear' *Parameter List *weight: (Shard(dim=1),) *bias: (Replicate(),) FORWARD PASS *c10d_functional.all_reduce: 1 **aten.addmm.default shape: [torch.Size([5]), torch.Size([4, 32]), torch.Size([32, 5])] sharding: [(Replicate(),), (Shard(dim=1),), (Shard(dim=0),)] device mesh: DeviceMesh('cuda', [0, 1, 2, 3]) BACKWARD PASS **aten.mm.default shape: [torch.Size([4, 5]), torch.Size([5, 32])] sharding: [(Replicate(),), (Shard(dim=1),)] device mesh: DeviceMesh('cuda', [0, 1, 2, 3]) **aten.mm.default shape: [torch.Size([5, 4]), torch.Size([4, 32])] sharding: [(Replicate(),), (Shard(dim=1),)] device mesh: DeviceMesh('cuda', [0, 1, 2, 3]) **aten.sum.dim_IntList shape: [torch.Size([4, 5])] sharding: [(Replicate(),)] device mesh: DeviceMesh('cuda', [0, 1, 2, 3]) **aten.add_.Tensor shape: [torch.Size([5]), torch.Size([5])] sharding: [(Replicate(),), (Replicate(),)] device mesh: DeviceMesh('cuda', [0, 1, 2, 3]) **aten.add_.Tensor shape: [torch.Size([5, 32]), torch.Size([5, 32])] sharding: [(Shard(dim=1),), (Shard(dim=1),)] device mesh: DeviceMesh('cuda', [0, 1, 2, 3]) rank1 1 get_comm_counts defaultdict(<class 'int'>, {<OpOverloadPacket(op='c10d_functional.all_reduce')>: 1}) get_sharding_info() {'ToyModel.in_proj.weight': (Shard(dim=0),), 'ToyModel.in_proj.bias': (Shard(dim=0),), 'ToyModel.out_proj.weight': (Shard(dim=1),), 'ToyModel.out_proj.bias': (Replicate(),)} generate_comm_debug_tracing_table Global FORWARD PASS *c10d_functional.all_reduce: 1 BACKWARD PASS ToyModel *module type: class '__main__.ToyModel' FORWARD PASS *c10d_functional.all_reduce: 1 ToyModel.in_proj *module type: class 'torch.nn.modules.linear.Linear' *Parameter List *weight: (Shard(dim=0),) *bias: (Shard(dim=0),) FORWARD PASS **aten.addmm.default shape: [torch.Size([32]), torch.Size([4, 10]), torch.Size([10, 32])] sharding: [(Shard(dim=0),), (Replicate(),), (Shard(dim=1),)] device mesh: DeviceMesh('cuda', [0, 1, 2, 3]) BACKWARD PASS **aten.mm.default shape: [torch.Size([32, 4]), torch.Size([4, 10])] sharding: [(Shard(dim=0),), (Replicate(),)] device mesh: DeviceMesh('cuda', [0, 1, 2, 3]) **aten.sum.dim_IntList shape: [torch.Size([4, 32])] sharding: [(Shard(dim=1),)] device mesh: DeviceMesh('cuda', [0, 1, 2, 3]) **aten.add_.Tensor shape: [torch.Size([32]), torch.Size([32])] sharding: [(Shard(dim=0),), (Shard(dim=0),)] device mesh: DeviceMesh('cuda', [0, 1, 2, 3]) **aten.add_.Tensor shape: [torch.Size([32, 10]), torch.Size([32, 10])] sharding: [(Shard(dim=0),), (Shard(dim=0),)] device mesh: DeviceMesh('cuda', [0, 1, 2, 3]) ToyModel.relu *module type: class 'torch.nn.modules.activation.ReLU' FORWARD PASS BACKWARD PASS ToyModel.out_proj *module type: class 'torch.nn.modules.linear.Linear' *Parameter List *weight: (Shard(dim=1),) *bias: (Replicate(),) FORWARD PASS *c10d_functional.all_reduce: 1 **aten.addmm.default shape: [torch.Size([5]), torch.Size([4, 32]), torch.Size([32, 5])] sharding: [(Replicate(),), (Shard(dim=1),), (Shard(dim=0),)] device mesh: DeviceMesh('cuda', [0, 1, 2, 3]) BACKWARD PASS **aten.mm.default shape: [torch.Size([4, 5]), torch.Size([5, 32])] sharding: [(Replicate(),), (Shard(dim=1),)] device mesh: DeviceMesh('cuda', [0, 1, 2, 3]) **aten.mm.default shape: [torch.Size([5, 4]), torch.Size([4, 32])] sharding: [(Replicate(),), (Shard(dim=1),)] device mesh: DeviceMesh('cuda', [0, 1, 2, 3]) **aten.sum.dim_IntList shape: [torch.Size([4, 5])] sharding: [(Replicate(),)] device mesh: DeviceMesh('cuda', [0, 1, 2, 3]) **aten.add_.Tensor shape: [torch.Size([5]), torch.Size([5])] sharding: [(Replicate(),), (Replicate(),)] device mesh: DeviceMesh('cuda', [0, 1, 2, 3]) **aten.add_.Tensor shape: [torch.Size([5, 32]), torch.Size([5, 32])] sharding: [(Shard(dim=1),), (Shard(dim=1),)] device mesh: DeviceMesh('cuda', [0, 1, 2, 3]) 06/16/2025 05:55:03 PM Tensor Parallel iter 2 completed 06/16/2025 05:55:03 PM Tensor Parallel iter 3 completed 06/16/2025 05:55:03 PM Tensor Parallel iter 4 completed 06/16/2025 05:55:03 PM Tensor Parallel iter 5 completed 06/16/2025 05:55:03 PM Tensor Parallel iter 6 completed 06/16/2025 05:55:04 PM Tensor Parallel iter 7 completed 06/16/2025 05:55:04 PM Tensor Parallel iter 8 completed 06/16/2025 05:55:04 PM Tensor Parallel iter 9 completed 06/16/2025 05:55:04 PM Tensor Parallel training completed! [rank0]:[W616 17:55:04.791527408 ProcessGroupNCCL.cpp:1516] Warning: WARNING: destroy_process_group() was not called before program exit, which can leak resources. For more info, please see https://pytorch.org/docs/stable/distributed.html#shutdown (function operator())

Starting PyTorch Sequence Parallel example on rank 0. 06/16/2025 05:53:21 PM Device Mesh created: device_mesh=DeviceMesh('cuda', [0, 1, 2, 3]) Starting PyTorch Sequence Parallel example on rank 3. Starting PyTorch Sequence Parallel example on rank 2. Starting PyTorch Sequence Parallel example on rank 1. model ToyModel( (in_proj): Linear(in_features=10, out_features=32, bias=True) (relu): ReLU() (out_proj): Linear(in_features=32, out_features=5, bias=True) ) 06/16/2025 05:53:24 PM Sequence Parallel training starting... rank2 0 get_comm_counts defaultdict(<class 'int'>, {<OpOverloadPacket(op='c10d_functional.all_gather_into_tensor')>: 2, <OpOverloadPacket(op='c10d_functional.reduce_scatter_tensor')>: 1}) get_sharding_info() {'ToyModel.in_proj.weight': (Shard(dim=0),), 'ToyModel.in_proj.bias': (Shard(dim=0),), 'ToyModel.out_proj.weight': (Shard(dim=1),), 'ToyModel.out_proj.bias': (Replicate(),)} generate_comm_debug_tracing_table Global FORWARD PASS *c10d_functional.all_gather_into_tensor: 1 *c10d_functional.reduce_scatter_tensor: 1 BACKWARD PASS *c10d_functional.all_gather_into_tensor: 1 ToyModel *module type: class '__main__.ToyModel' FORWARD PASS *c10d_functional.all_gather_into_tensor: 1 *c10d_functional.reduce_scatter_tensor: 1 BACKWARD PASS *c10d_functional.all_gather_into_tensor: 1 ToyModel.in_proj *module type: class 'torch.nn.modules.linear.Linear' *Parameter List *weight: (Shard(dim=0),) *bias: (Shard(dim=0),) FORWARD PASS *c10d_functional.all_gather_into_tensor: 1 **aten.addmm.default shape: [torch.Size([32]), torch.Size([4, 10]), torch.Size([10, 32])] sharding: [(Shard(dim=0),), (Replicate(),), (Shard(dim=1),)] device mesh: DeviceMesh('cuda', [0, 1, 2, 3]) **aten.zeros_like.default shape: [torch.Size([32, 10])] sharding: [(Shard(dim=0),)] device mesh: DeviceMesh('cuda', [0, 1, 2, 3]) **aten.zeros_like.default shape: [torch.Size([32, 10])] sharding: [(Shard(dim=0),)] device mesh: DeviceMesh('cuda', [0, 1, 2, 3]) **aten.zeros_like.default shape: [torch.Size([32])] sharding: [(Shard(dim=0),)] device mesh: DeviceMesh('cuda', [0, 1, 2, 3]) **aten.zeros_like.default shape: [torch.Size([32])] sharding: [(Shard(dim=0),)] device mesh: DeviceMesh('cuda', [0, 1, 2, 3]) **aten.zeros_like.default shape: [torch.Size([5, 32])] sharding: [(Shard(dim=1),)] device mesh: DeviceMesh('cuda', [0, 1, 2, 3]) **aten.zeros_like.default shape: [torch.Size([5, 32])] sharding: [(Shard(dim=1),)] device mesh: DeviceMesh('cuda', [0, 1, 2, 3]) **aten.zeros_like.default shape: [torch.Size([5])] sharding: [(Replicate(),)] device mesh: DeviceMesh('cuda', [0, 1, 2, 3]) **aten.zeros_like.default shape: [torch.Size([5])] sharding: [(Replicate(),)] device mesh: DeviceMesh('cuda', [0, 1, 2, 3]) BACKWARD PASS **aten.mm.default shape: [torch.Size([32, 4]), torch.Size([4, 10])] sharding: [(Shard(dim=0),), (Replicate(),)] device mesh: DeviceMesh('cuda', [0, 1, 2, 3]) **aten.sum.dim_IntList shape: [torch.Size([4, 32])] sharding: [(Shard(dim=1),)] device mesh: DeviceMesh('cuda', [0, 1, 2, 3]) ToyModel.relu *module type: class 'torch.nn.modules.activation.ReLU' FORWARD PASS BACKWARD PASS ToyModel.out_proj *module type: class 'torch.nn.modules.linear.Linear' *Parameter List *weight: (Shard(dim=1),) *bias: (Replicate(),) FORWARD PASS *c10d_functional.reduce_scatter_tensor: 1 **aten.addmm.default shape: [torch.Size([5]), torch.Size([4, 32]), torch.Size([32, 5])] sharding: [(Replicate(),), (Shard(dim=1),), (Shard(dim=0),)] device mesh: DeviceMesh('cuda', [0, 1, 2, 3]) BACKWARD PASS *c10d_functional.all_gather_into_tensor: 1 **aten.mm.default shape: [torch.Size([4, 5]), torch.Size([5, 32])] sharding: [(Replicate(),), (Shard(dim=1),)] device mesh: DeviceMesh('cuda', [0, 1, 2, 3]) **aten.mm.default shape: [torch.Size([5, 4]), torch.Size([4, 32])] sharding: [(Replicate(),), (Shard(dim=1),)] device mesh: DeviceMesh('cuda', [0, 1, 2, 3]) **aten.sum.dim_IntList shape: [torch.Size([4, 5])] sharding: [(Replicate(),)] device mesh: DeviceMesh('cuda', [0, 1, 2, 3]) 06/16/2025 05:53:25 PM Sequence Parallel iter 0 completed rank0 0 get_comm_counts defaultdict(<class 'int'>, {<OpOverloadPacket(op='c10d_functional.all_gather_into_tensor')>: 2, <OpOverloadPacket(op='c10d_functional.reduce_scatter_tensor')>: 1}) get_sharding_info() {'ToyModel.in_proj.weight': (Shard(dim=0),), 'ToyModel.in_proj.bias': (Shard(dim=0),), 'ToyModel.out_proj.weight': (Shard(dim=1),), 'ToyModel.out_proj.bias': (Replicate(),)} generate_comm_debug_tracing_table Global FORWARD PASS *c10d_functional.all_gather_into_tensor: 1 *c10d_functional.reduce_scatter_tensor: 1 BACKWARD PASS *c10d_functional.all_gather_into_tensor: 1 ToyModel *module type: class '__main__.ToyModel' FORWARD PASS *c10d_functional.all_gather_into_tensor: 1 *c10d_functional.reduce_scatter_tensor: 1 BACKWARD PASS *c10d_functional.all_gather_into_tensor: 1 ToyModel.in_proj *module type: class 'torch.nn.modules.linear.Linear' *Parameter List *weight: (Shard(dim=0),) *bias: (Shard(dim=0),) FORWARD PASS *c10d_functional.all_gather_into_tensor: 1 **aten.addmm.default shape: [torch.Size([32]), torch.Size([4, 10]), torch.Size([10, 32])] sharding: [(Shard(dim=0),), (Replicate(),), (Shard(dim=1),)] device mesh: DeviceMesh('cuda', [0, 1, 2, 3]) **aten.zeros_like.default shape: [torch.Size([32, 10])] sharding: [(Shard(dim=0),)] device mesh: DeviceMesh('cuda', [0, 1, 2, 3]) **aten.zeros_like.default shape: [torch.Size([32, 10])] sharding: [(Shard(dim=0),)] device mesh: DeviceMesh('cuda', [0, 1, 2, 3]) **aten.zeros_like.default shape: [torch.Size([32])] sharding: [(Shard(dim=0),)] device mesh: DeviceMesh('cuda', [0, 1, 2, 3]) **aten.zeros_like.default shape: [torch.Size([32])] sharding: [(Shard(dim=0),)] device mesh: DeviceMesh('cuda', [0, 1, 2, 3]) **aten.zeros_like.default shape: [torch.Size([5, 32])] sharding: [(Shard(dim=1),)] device mesh: DeviceMesh('cuda', [0, 1, 2, 3]) **aten.zeros_like.default shape: [torch.Size([5, 32])] sharding: [(Shard(dim=1),)] device mesh: DeviceMesh('cuda', [0, 1, 2, 3]) **aten.zeros_like.default shape: [torch.Size([5])] sharding: [(Replicate(),)] device mesh: DeviceMesh('cuda', [0, 1, 2, 3]) **aten.zeros_like.default shape: [torch.Size([5])] sharding: [(Replicate(),)] device mesh: DeviceMesh('cuda', [0, 1, 2, 3]) BACKWARD PASS **aten.mm.default shape: [torch.Size([32, 4]), torch.Size([4, 10])] sharding: [(Shard(dim=0),), (Replicate(),)] device mesh: DeviceMesh('cuda', [0, 1, 2, 3]) **aten.sum.dim_IntList shape: [torch.Size([4, 32])] sharding: [(Shard(dim=1),)] device mesh: DeviceMesh('cuda', [0, 1, 2, 3]) ToyModel.relu *module type: class 'torch.nn.modules.activation.ReLU' FORWARD PASS BACKWARD PASS ToyModel.out_proj *module type: class 'torch.nn.modules.linear.Linear' *Parameter List *weight: (Shard(dim=1),) *bias: (Replicate(),) FORWARD PASS *c10d_functional.reduce_scatter_tensor: 1 **aten.addmm.default shape: [torch.Size([5]), torch.Size([4, 32]), torch.Size([32, 5])] sharding: [(Replicate(),), (Shard(dim=1),), (Shard(dim=0),)] device mesh: DeviceMesh('cuda', [0, 1, 2, 3]) BACKWARD PASS *c10d_functional.all_gather_into_tensor: 1 **aten.mm.default shape: [torch.Size([4, 5]), torch.Size([5, 32])] sharding: [(Replicate(),), (Shard(dim=1),)] device mesh: DeviceMesh('cuda', [0, 1, 2, 3]) **aten.mm.default shape: [torch.Size([5, 4]), torch.Size([4, 32])] sharding: [(Replicate(),), (Shard(dim=1),)] device mesh: DeviceMesh('cuda', [0, 1, 2, 3]) **aten.sum.dim_IntList shape: [torch.Size([4, 5])] sharding: [(Replicate(),)] device mesh: DeviceMesh('cuda', [0, 1, 2, 3]) rank1 0 get_comm_counts defaultdict(<class 'int'>, {<OpOverloadPacket(op='c10d_functional.all_gather_into_tensor')>: 2, <OpOverloadPacket(op='c10d_functional.reduce_scatter_tensor')>: 1}) get_sharding_info() {'ToyModel.in_proj.weight': (Shard(dim=0),), 'ToyModel.in_proj.bias': (Shard(dim=0),), 'ToyModel.out_proj.weight': (Shard(dim=1),), 'ToyModel.out_proj.bias': (Replicate(),)} generate_comm_debug_tracing_table Global FORWARD PASS *c10d_functional.all_gather_into_tensor: 1 *c10d_functional.reduce_scatter_tensor: 1 BACKWARD PASS *c10d_functional.all_gather_into_tensor: 1 ToyModel *module type: class '__main__.ToyModel' FORWARD PASS *c10d_functional.all_gather_into_tensor: 1 *c10d_functional.reduce_scatter_tensor: 1 BACKWARD PASS *c10d_functional.all_gather_into_tensor: 1 ToyModel.in_proj *module type: class 'torch.nn.modules.linear.Linear' *Parameter List *weight: (Shard(dim=0),) *bias: (Shard(dim=0),) FORWARD PASS *c10d_functional.all_gather_into_tensor: 1 **aten.addmm.default shape: [torch.Size([32]), torch.Size([4, 10]), torch.Size([10, 32])] sharding: [(Shard(dim=0),), (Replicate(),), (Shard(dim=1),)] device mesh: DeviceMesh('cuda', [0, 1, 2, 3]) **aten.zeros_like.default shape: [torch.Size([32, 10])] sharding: [(Shard(dim=0),)] device mesh: DeviceMesh('cuda', [0, 1, 2, 3]) **aten.zeros_like.default shape: [torch.Size([32, 10])] sharding: [(Shard(dim=0),)] device mesh: DeviceMesh('cuda', [0, 1, 2, 3]) **aten.zeros_like.default shape: [torch.Size([32])] sharding: [(Shard(dim=0),)] device mesh: DeviceMesh('cuda', [0, 1, 2, 3]) **aten.zeros_like.default shape: [torch.Size([32])] sharding: [(Shard(dim=0),)] device mesh: DeviceMesh('cuda', [0, 1, 2, 3]) **aten.zeros_like.default shape: [torch.Size([5, 32])] sharding: [(Shard(dim=1),)] device mesh: DeviceMesh('cuda', [0, 1, 2, 3]) **aten.zeros_like.default shape: [torch.Size([5, 32])] sharding: [(Shard(dim=1),)] device mesh: DeviceMesh('cuda', [0, 1, 2, 3]) **aten.zeros_like.default shape: [torch.Size([5])] sharding: [(Replicate(),)] device mesh: DeviceMesh('cuda', [0, 1, 2, 3]) **aten.zeros_like.default shape: [torch.Size([5])] sharding: [(Replicate(),)] device mesh: DeviceMesh('cuda', [0, 1, 2, 3]) BACKWARD PASS **aten.mm.default shape: [torch.Size([32, 4]), torch.Size([4, 10])] sharding: [(Shard(dim=0),), (Replicate(),)] device mesh: DeviceMesh('cuda', [0, 1, 2, 3]) **aten.sum.dim_IntList shape: [torch.Size([4, 32])] sharding: [(Shard(dim=1),)] device mesh: DeviceMesh('cuda', [0, 1, 2, 3]) ToyModel.relu *module type: class 'torch.nn.modules.activation.ReLU' FORWARD PASS BACKWARD PASS ToyModel.out_proj *module type: class 'torch.nn.modules.linear.Linear' *Parameter List *weight: (Shard(dim=1),) *bias: (Replicate(),) FORWARD PASS *c10d_functional.reduce_scatter_tensor: 1 **aten.addmm.default shape: [torch.Size([5]), torch.Size([4, 32]), torch.Size([32, 5])] sharding: [(Replicate(),), (Shard(dim=1),), (Shard(dim=0),)] device mesh: DeviceMesh('cuda', [0, 1, 2, 3]) BACKWARD PASS *c10d_functional.all_gather_into_tensor: 1 **aten.mm.default shape: [torch.Size([4, 5]), torch.Size([5, 32])] sharding: [(Replicate(),), (Shard(dim=1),)] device mesh: DeviceMesh('cuda', [0, 1, 2, 3]) **aten.mm.default shape: [torch.Size([5, 4]), torch.Size([4, 32])] sharding: [(Replicate(),), (Shard(dim=1),)] device mesh: DeviceMesh('cuda', [0, 1, 2, 3]) **aten.sum.dim_IntList shape: [torch.Size([4, 5])] sharding: [(Replicate(),)] device mesh: DeviceMesh('cuda', [0, 1, 2, 3]) rank3 0 get_comm_counts defaultdict(<class 'int'>, {<OpOverloadPacket(op='c10d_functional.all_gather_into_tensor')>: 2, <OpOverloadPacket(op='c10d_functional.reduce_scatter_tensor')>: 1}) get_sharding_info() {'ToyModel.in_proj.weight': (Shard(dim=0),), 'ToyModel.in_proj.bias': (Shard(dim=0),), 'ToyModel.out_proj.weight': (Shard(dim=1),), 'ToyModel.out_proj.bias': (Replicate(),)} generate_comm_debug_tracing_table Global FORWARD PASS *c10d_functional.all_gather_into_tensor: 1 *c10d_functional.reduce_scatter_tensor: 1 BACKWARD PASS *c10d_functional.all_gather_into_tensor: 1 ToyModel *module type: class '__main__.ToyModel' FORWARD PASS *c10d_functional.all_gather_into_tensor: 1 *c10d_functional.reduce_scatter_tensor: 1 BACKWARD PASS *c10d_functional.all_gather_into_tensor: 1 ToyModel.in_proj *module type: class 'torch.nn.modules.linear.Linear' *Parameter List *weight: (Shard(dim=0),) *bias: (Shard(dim=0),) FORWARD PASS *c10d_functional.all_gather_into_tensor: 1 **aten.addmm.default shape: [torch.Size([32]), torch.Size([4, 10]), torch.Size([10, 32])] sharding: [(Shard(dim=0),), (Replicate(),), (Shard(dim=1),)] device mesh: DeviceMesh('cuda', [0, 1, 2, 3]) **aten.zeros_like.default shape: [torch.Size([32, 10])] sharding: [(Shard(dim=0),)] device mesh: DeviceMesh('cuda', [0, 1, 2, 3]) **aten.zeros_like.default shape: [torch.Size([32, 10])] sharding: [(Shard(dim=0),)] device mesh: DeviceMesh('cuda', [0, 1, 2, 3]) **aten.zeros_like.default shape: [torch.Size([32])] sharding: [(Shard(dim=0),)] device mesh: DeviceMesh('cuda', [0, 1, 2, 3]) **aten.zeros_like.default shape: [torch.Size([32])] sharding: [(Shard(dim=0),)] device mesh: DeviceMesh('cuda', [0, 1, 2, 3]) **aten.zeros_like.default shape: [torch.Size([5, 32])] sharding: [(Shard(dim=1),)] device mesh: DeviceMesh('cuda', [0, 1, 2, 3]) **aten.zeros_like.default shape: [torch.Size([5, 32])] sharding: [(Shard(dim=1),)] device mesh: DeviceMesh('cuda', [0, 1, 2, 3]) **aten.zeros_like.default shape: [torch.Size([5])] sharding: [(Replicate(),)] device mesh: DeviceMesh('cuda', [0, 1, 2, 3]) **aten.zeros_like.default shape: [torch.Size([5])] sharding: [(Replicate(),)] device mesh: DeviceMesh('cuda', [0, 1, 2, 3]) BACKWARD PASS **aten.mm.default shape: [torch.Size([32, 4]), torch.Size([4, 10])] sharding: [(Shard(dim=0),), (Replicate(),)] device mesh: DeviceMesh('cuda', [0, 1, 2, 3]) **aten.sum.dim_IntList shape: [torch.Size([4, 32])] sharding: [(Shard(dim=1),)] device mesh: DeviceMesh('cuda', [0, 1, 2, 3]) ToyModel.relu *module type: class 'torch.nn.modules.activation.ReLU' FORWARD PASS BACKWARD PASS ToyModel.out_proj *module type: class 'torch.nn.modules.linear.Linear' *Parameter List *weight: (Shard(dim=1),) *bias: (Replicate(),) FORWARD PASS *c10d_functional.reduce_scatter_tensor: 1 **aten.addmm.default shape: [torch.Size([5]), torch.Size([4, 32]), torch.Size([32, 5])] sharding: [(Replicate(),), (Shard(dim=1),), (Shard(dim=0),)] device mesh: DeviceMesh('cuda', [0, 1, 2, 3]) BACKWARD PASS *c10d_functional.all_gather_into_tensor: 1 **aten.mm.default shape: [torch.Size([4, 5]), torch.Size([5, 32])] sharding: [(Replicate(),), (Shard(dim=1),)] device mesh: DeviceMesh('cuda', [0, 1, 2, 3]) **aten.mm.default shape: [torch.Size([5, 4]), torch.Size([4, 32])] sharding: [(Replicate(),), (Shard(dim=1),)] device mesh: DeviceMesh('cuda', [0, 1, 2, 3]) **aten.sum.dim_IntList shape: [torch.Size([4, 5])] sharding: [(Replicate(),)] device mesh: DeviceMesh('cuda', [0, 1, 2, 3]) 06/16/2025 05:53:25 PM Sequence Parallel iter 1 completed 06/16/2025 05:53:25 PM Sequence Parallel iter 2 completed 06/16/2025 05:53:25 PM Sequence Parallel iter 3 completed 06/16/2025 05:53:25 PM Sequence Parallel iter 4 completed 06/16/2025 05:53:25 PM Sequence Parallel iter 5 completed 06/16/2025 05:53:25 PM Sequence Parallel iter 6 completed 06/16/2025 05:53:25 PM Sequence Parallel iter 7 completed 06/16/2025 05:53:25 PM Sequence Parallel iter 8 completed 06/16/2025 05:53:25 PM Sequence Parallel iter 9 completed 06/16/2025 05:53:25 PM Sequence Parallel training completed! [rank0]:[W616 17:53:25.948217933 ProcessGroupNCCL.cpp:1516] Warning: WARNING: destroy_process_group() was not called before program exit, which can leak resources. For more info, please see https://pytorch.org/docs/stable/distributed.html#shutdown (function operator())

Sorry I meant on non CUDA devices, as does this API work if you use MPS or CPU?

torch,accelerator works for cuda and non-cuda GPUs and accelerators. CommDebugMode is also a PyTorch feature, so should work for all devices. If not, that would be a bug.

@msaroufim . if there is no more question, could it be merged ?

Can you please attach logs confirming this works on CPU?

@msaroufim , I have only used GPU's for these kind of work. The accelerator api does not support CPU's also. Also do not know whether TP and SP are supported on CPU's. If so, what distributed backend is used. The original code also would not work on CPUs as far as I can tell. In summary, these two examples were not written for CPUs. Adding CPU support will be a very significant change, if at all possible, as far as I can tell.

I guess I'm confused by the goal of this PR overall

Why merge a device agnostic API if the code is only expected to work on a single device? If that's the case then keeping cuda is actually clearer

I'm not sure why comm_debuug mode is introduced and why it should be default behavior?

@msaroufim , great questions. Let me address those.

There are non-cuda GPU's/accelerators ( e.g. XPU, MTIA, HPU, etc.). It is a write once, run anywhere interface. Hence, model code would run in any of the supported accelerators without requiring surgery.

As these are distributed example codes, a way to see what is happening in the distributed layer should be very informative. It can be bracketed by an input option also, if that is better.

Ok yeah this makes sense, we can merge this if you fix breakage in CI job and make the Comm Debug optional in another PR

githubsgi · 2025-06-25T06:42:29Z

Looks like the failing cuda test below ( [Run Distributed Examples / test (pull_request) is done with a relatively old version of PyTorch ( torch==2.4.0.dev20240605+cu11 ). The upcoming release is 2,8 .

githubsgi and others added 2 commits June 11, 2025 22:24

Cuda to accelerator, +CommDebugMode

4b3abaa

cuda to accelerator

cf381e0

facebook-github-bot added the cla signed label Jun 11, 2025

githubsgi changed the title ~~TP SP example improvement~~ TP SP examples improvement Jun 11, 2025

msaroufim reviewed Jun 15, 2025

View reviewed changes

Moving CommDebugMode code to a seperate PR.

d16c819

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

TP SP examples improvement #1354

TP SP examples improvement #1354

Uh oh!

githubsgi commented Jun 11, 2025

Uh oh!

netlify bot commented Jun 11, 2025 •

edited

Loading

Uh oh!

msaroufim Jun 15, 2025

Uh oh!

githubsgi Jun 16, 2025

Uh oh!

msaroufim Jun 18, 2025

Uh oh!

githubsgi Jun 18, 2025

Uh oh!

githubsgi Jun 23, 2025

Uh oh!

msaroufim Jun 23, 2025

Uh oh!

githubsgi Jun 23, 2025 •

edited

Loading

Uh oh!

msaroufim Jun 23, 2025

Uh oh!

githubsgi Jun 23, 2025

Uh oh!

msaroufim Jun 24, 2025

Uh oh!

githubsgi commented Jun 25, 2025 •

edited

Loading

Uh oh!

Uh oh!

TP SP examples improvement #1354

Are you sure you want to change the base?

TP SP examples improvement #1354

Uh oh!

Conversation

githubsgi commented Jun 11, 2025

Uh oh!

netlify bot commented Jun 11, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

✅ Deploy Preview for pytorch-examples-preview canceled.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

githubsgi Jun 23, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

githubsgi commented Jun 25, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

netlify bot commented Jun 11, 2025 •

edited

Loading

githubsgi Jun 23, 2025 •

edited

Loading

githubsgi commented Jun 25, 2025 •

edited

Loading